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Abstract 

The question of polynomial learnability of probability distributions, particularly Gaussian 
mixture distributions, has recently received significant attention in theoretical computer science 
and machine learning. However, despite major progress, the general question of polynomial 
learnability of Gaussian mixture distributions still remained open. The current work resolves 
the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary 
fixed number of components. 

The result on learning Gaussian mixtures relies on an analysis of distributions belonging to 
what we call polynomial families in low dimension. These families are characterized by their mo- 
ments being polynomial in parameters and include almost all common probability distributions 
as well as their mixtures and products. Using tools from real algebraic geometry, we show that 
parameters of any distribution belonging to such a family can be learned in polynomial time and 
using a polynomial number of sample points. The result on learning polynomial families is quite 
general and is of independent interest. 

To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide 
a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a 
high-dimensional mixture to a polynomial number of parameter estimations in low dimension. 
Combining this reduction with the results on polynomial families yields our result on learning 
arbitrary Gaussian mixtures in high dimensions. 



1 Introduction 



Estimating parameters of a model from sampled data is one of the oldest and most general problems 
of statistical inference. Given a number of samples, one needs to choose a distribution that best fits 
the observed data. While traditionally theoretical analysis in the statistical literature has concen- 
trated on rates (e.g., minimax rates), in recent years other computational aspects of this problem, 
especially as dependence on dimension of the space, have attracted attention. In particular, a recent 
line of work in the theoretical computer science and learning communities has been concerned with 
learning the distribution in time and using the number of samples, polynomial in parameters and the 
dimension of the space. This effort has been particularly directed at the family of Gaussian Mixture 
models due to their simple formulation and widespread use in applications spanning areas such as 
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Table 1: Partial summary of results on Gaussian mixture model learning. Note that [T3] addresses 
a somewhat different problem. The last two methods allow the separation between the means to 
be zero, assuming different covariance matrices. 



computer vision, speech recognition, and many others (see, e.g..|18 t 119 1 123j). This line of research 
started with the work of Dasgupta [10], who was the first to show that learning the parameters of 
a Gaussian mixture distribution in time polynomial in the dimension of the space n was possible at 
all. This work has been refined and extended in a number of consequent papers. The results in [10] 
required separation between mixture components on the order of ^/n. That was later improved to 
of f2(n4 ) in [11] for mixtures of spherical Gaussians and in [2] for general Gaussians. The separation 
requirement was further reduced and made independent of n to the order of Q{ki) in [23] for a 

3 

mixture of k spherical Gaussians and to the order of ^{^) in [T7] for logconcave distributions. In 
[1] the separation requirement was further reduced to Q{k + ^/kTogn). An extension of PGA called 
isotropic PC A was introduced in [5] to learn mixtures of Gaussians when any pair of Gaussian 
components is separated by a hyperplane having very small overlap along the hyperplane direction 
(so-called " pancake layering problem" ) . A number of recent papers [6] [71 [HI [9l |T3] addressed related 
problems, such as learning mixture of product distributions and heavy tailed distributions. 

However all of these papers assumed a minimum separation between the components, which is an 
increasing function of the dimension n and/or the number of components k. The general question 
of learning parameters of a distribution without any separation conditions, remained open. The 
first result in that direction was obtained in Feldman, et al., [H], which showed that the density 
(but not the parameters) of mixtures of axis aligned Gaussians can be learned in polynomial time 
using the method of moments. 

Very recently two papers [H [E] independently addressed two special cases of Gaussian mixture 
learning without separation assumption. In Kalai, et al., [16] the authors showed that a mixture of 
two Gaussians with arbitrary covariance matrices can be learned in polynomial time. The technique 
relies on a randomized algorithm to reduce the problem to one dimension. The key argument of 
the paper is based on deconvolving the one-dimensional mixture to increase the separation between 
the components and carefully analyzing the moments of the deconvolved mixture in order to apply 
the method of moments. In [4] it is shown that a mixture of k identical spherical Gaussians can be 
learned in time polynomial in dimension. The key result is based on analyzing the Fourier transform 
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of the distribution in one dimension to give a lower bound on the norm. However, it is not clear 
whether the techniques of either [16j or [5] could be applied to the general case with an arbitrary 
number of components and covariance matrices. 

In this paper we resolve the polynomial learnability problem by proving that there exists a polyno- 
mial algorithm to estimate parameters of a general high-dimensional mixture with arbitrary fixed 
number of Gaussians components without any additional assumptions. Table [1] briefly summarizes 
the progress in the area and our result. 

Our main result for Gaussian mixtures relies on a quite general result of independent interest on 
learning what we call polynomial families. These families are characterized by their moments being 
polynomial in the parameters of a distribution. It turns out that almost all common distribution 
families, e.g., Gaussian, exponential, uniform, Laplace, binomial, Poisson and a number of others, 
(see Table [2] in Appendix [A] for a longer list and a description of their moments), as well as their 
mixtures and (tensor) products have this property. Our technique uses methods of real algebraic ge- 
ometry and combines them with the classical method of moments (originally introduced by Pearson 
in |20j to analyze Gaussian mixtures). 

We note that there have been applications of algebraic geometry in the field of statistics, particularly 
in conditional independence testing and likelihood estimation for discrete distributions and expo- 
nential families (see, e.g., jl2j). We note that a mixture of more than one Gaussian distributions is 
a family of continuous distributions, which is not an exponential family. 

Below we give a brief summary of the main results and the structure of the paper. 

Brief outline of the paper. 

Section [2l We start Section [2] by introducing the problem of parameter learning and defining the 
notion of a polynomial family. We proceed to prove the main result showing that parameters of a 
distribution from a polynomial family can be learned with confidence 1 — (5 up to precision e using 
the number of samples poly max{^, -^)), where ^ is the radius of identifiability, a measure of 
intrinsic hardness of unique parameter identification for a distributioij^- In fact, the result is more 
general, even if the radius of identifiability is zero, parameters can still be learned up to a certain 
equivalence relation defined in the paper. 

The proof consists of the two main steps. The first step uses the Hilbert basis theorem for an 
appropriately defined ideal in the ring of polynomials to show that a fixed set of (possibly high- 
dimensional) moments uniquely identifies the distribution. 

In the second step, we pose parameter estimation problem as a system of quantified algebraic 
equations and inequalities using the finite set of moments obtained in the first step. We use quan- 
tifier elimination for semi-algebraic sets (Tarski-Seidenberg theorem) to prove that there exists a 
polynomial algorithm for parameter learning. 

Section [3l In Section [3] we prove our main results on learning Gaussian mixture distributions 
in high dimensions. The main difficulty is that the general results of Section [2] cannot be applied 
directly since the number of parameters increases with the dimension of the space. To overcome 
this issue, we prove that the Gaussian family has the property that we call polynomial reducibility. 

^For example, it is impossible to identify mixing coefficients of a mixture of two Gaussians witfi identical means 
and variances, thus in that case = 0. See Section [3] for the detailed analysis of Gaussian mixtures. 
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That is the parameters of a distribution in n dimensions can be recovered from a poly{n) number of 
low-dimensional projections. Specifically, we show that a mixture of Gaussians with k components 
can be recovered using a polynomial number of projections to (2A;^ + 2)-dimensional space. This 
leads us to Theorem 13.11 our main result for parameter learning on Gaussian mixtures. We show 
that parameters of a Gaussian mixture can be learned with precision e and confidence 1 — 5, using 
the number of samples polynomial in dimension n, ^ and max(i,-^). Moreover, we also provide 
explicit formula for the radius of identifiability of Gaussian mixtures. If we are given an a priori 
bounds on the minimum mixing weight and the minimum separation between the mean/covariance 
pairs, that leads to an upper bound on ^. For example, our results holds even in the extreme 
case where all components have the same mean, as long as the covariance matrices are different. In 
Theorem 13.21 we also show that in the absence of such a lower bound, 1% can be estimated directly 
from the data. 

We discuss other polynomially reducible families, where a similar approach would yield results on 
polynomial learnability. 

In Section |4] we conclude and discuss some limitations of our results, directions of future work and 
conjectures. 

2 Learning Polynomial Families 

In this section we prove some general learnability results for a large class of probability distributions 
that we call polynomial families, which are characterized by the moments being polynomial functions 
of parameters. This class turns out to contain nearly all commonly used probability distributions, 
as well as their mixtures and (tensor) products. See Appendix A (Table [2]) for a partial list together 
with the description of their moments either explicitly or through a recurrence relation, as well as 
some examples of families, which are not polynomial (Table [3|). 

The main result in this section is Theorem 12.81 
which shows that there exists an algorithm to learn 
the parameters of a polynomial distribution using 
a polynomial number of samples. 

We start with the outline of the standard parame- 
ter learning problem. Let pQ, 9 = (0^, . . . , 9"^), 9 S 
C M"^ be a m-parametric family of probabil- 
ity distributions in The problem of parame- 
ter learning is the following: given precision e and 
confidence 5, and some number n(e, 5) of points 
sampled from pe, we need to provide an estimate 
9, such that ||0 — 0|| < e with probability at least 
1-5. 

However, for many families identifying the values 
of parameters uniquely is impossible, due to the 
fact that several different values of parameters may 
correspond to the same probability distribution. 




Figure 1: li and w are close to two values of 
parameters 9' and lo with identical probability dis- 
tribution, then it is be hard to distinguish between 
them from sampled data, even when cjj] is large. 
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Moreover, if two values of parameters, say, 9 and 
oj are close to two values of parameters, 6' and uj' respectively, which have identical probability dis- 
tributions, then it may be hard to distinguish between them. This situation is illustrated in Fig. [TJ 
These observations suggests that a more general formulation of learning distribution parameters 
needs to take these into account. A mathematical formalization of the more general of learnability 
will be given in Eq{Tl which defines a notion of a neighborhood taking parameters with identical 
probability distribution into account. An e-" neighborhood" of 9, M{9, e), is shown in gray in Fig. [TJ 
We will also introduce the notion of the radius of identifiability ^{9) (definition 12. 9p to give a quan- 
tification of how hard it may be to identify the parameters. For example, parameters 9 for which 
M{9) = cannot be identified given any amount of data. In Fig. [U the radius of identifiability M{9) 
is equal to e'. 

For mixtures of Gaussians any permutation of the mixture components has the same distribution, 
while a component with zero mixing weight may have arbitrary mean/covariance. If two components 
have the same mean/covariance pair, then the mixing coefficients are not defined uniquely. However, 
assuming that the mean/ variance pairs for any two components are different and that the mixing 
coefficients are non-zero, the parameters are defined uniquely up to a permutation of components 
(see Section [2]). 

Our main Theorem 12.81 applies even when parameters of a probability distributions are not defined 
uniquly, including the standard definition of parameter learning as a special case (see Corollary 12. 101 
and Corollarv l2.1ip . 

In Subsection 12.11 we prove the basic properties of polynomial families, including the key result. 
Theorem 12. 3| which shows that a finite set of moments uniquely determines the distribution. 

In Subsection 12.21 we define the extended notion of a neighborhood M{9, e) and discuss its basic 
properties. We proceed to obtain the main technical result, a lower bound in Theorem 12.51 This, 
together with the upper bound in Proposition 12.71 allows us to set up a grid search to prove the 
main Theorem 12.81 We also define the radius of identifiability, and derive Corollary 12.101 and 
CoronaryEm 

2.1 Polynomial Families and Finite Sets of Moments 

We start by assuming that the parameter set is a compact semi-algebraic subset of M™. Recall 
that a semi-algebraic set in is a finite union of sets defined by a system of algebraic equations and 
inequalities. A sphere, a polytope, the sets of symmetric and orthogonal matrices are all examples of 
semi-algebraic sets. For example, a typical family of Gaussian mixture distributions with bounded 
means and bounded (in norm) covariance matrices would satisfy this condition. 

The family of semi-algebraic sets is closed under finite union, intersection and taking complements. 
Importantly, the Tarski-Seidenberg theorem states that a linear projection of a semi-algebraic set 
is also semi-algebraic. This is equivalent to the elimination of quantifiers for semi-algebrac sets, 
which we will need shortly. See [3] for a review of results on real algebraic geometry. 

Definition 2.1 (Polynomial family). We call the family pg a polynomial family, if each (raw l- 
dimensional) moment Mi^^,,,^^{9) = f x^^ . . . xj' dpQ of the distribution exists and can be represented 
as a polynomial of the parameters {9^ , ... ,9^). We also require that each pQ should he defined 
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uniquely by its moment; 



We will order the moments Mi^^^^^^i^ lexicographically and denote them by Mi(^), . . . ,Mn{0), ... In 
the one-dimensional case this corresponds to the standard ordering of the moments. 

As it turns out, most of the common families of probability distributions are, in fact, polynomial 
(see Appendix A). Moreover, a mixture, a product or a linear transformation of polynomial families 
is also a polynomial family, as stated in the following 

Lemma 2.2. Let pg, G and q^^, lo £ be polynomial families. Then the following families are 
also polynomial: 

(a) the family wipe + W2qui, wi, W2 £ wi + W2 = 1- 

(b) the family pe^ui{x, y) = pe{x) x pu,{y), {0, u) gQ xQ. 

(c) the family pAe, where A S W^^"^ is a fixed matrix and AO € C M™. 

The proof follows directly from the linearity of the integral, the Fubini's theorem and the fact that 
polynomial functions stay polynomial under a linear change. 

Note that a multivariate Gaussian distribution is a product of univariate Gaussians along its prin- 
cipal directions of the covariance matrix. Since the standard coordinates can be transformed to 
principal coordinates by a linear transformation, a multivariate Gaussian is a polynomial family. 
Hence a general mixture of k multivariate normal distributions in M' is also a polynomial family 
with Ik + ^/(/ + \)k + k — 1 parameters. 

Let us now recall that a family pg, is called identifiable if pg^ ^ pg^ for any 6i ^ 02. We will now 
prove the following 

Theorem 2.3. Let pg be a polynomial family of distributions. Then there exists a positive integer 
N , such that pg^ = pg^ if and only if Mi[Oi) = Mj(02) for all i = 1, . . . , N . In the case when the 
family pg is identifiable, the first N moments are sufficient to uniquely identify the parameter 9. 

Proof: 

Since Let pg is a polynomial family, each Mi{6) is a polynomial of 6. Let 6i = {6\, . . . , 6^) and 
02 = {el..., 9^). Let 

Pi{9l ...,9^,91,..., 9^) = M,(9{) - MM 

be a polynomial of 2m variables. Now let Ij be the ideal in the ring of polynomials of 2m variables 
generated by the polynomials Pi, . . . ,Pj. Thus we have an increasing sequence of ideals Xi C X2 C 
X3 . . . Let X = Ij. By the Hilbert basis theorem, the ideal X is finitely generated, which implies 
that for some N large enough, Xjv contains all of the generators. Therefore for any M > N we can 
write 

N 

Pm{9i,92) = Y,ai{9i,92)Pi{9i,92) 
1=1 

for some polynomials Oj. Thus if Pi{0i,92) = for i = 1,. . . ,N then Pi{9i,92) = for any i. 
Recalling the definition of Pm , we conclude that all moments of pg-^ and pg^ coincide if and only if 
the first A'^ moments of these distributions are the same. Since the sequence of moments defines the 
distribution uniquely, the statement of the theorem follows. □ 

■^This is true under some mild conditions, e.g., if tiie moment generating function converges in a neighborhood of 
zero |15] . 
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2.2 Learning Polynomial Families 

We will now introduce a notion of an e-" neighbourhood" of a point, which takes into account that 
different parameters may have identical probability distribution. We proceed to prove the main 
Theorem 12.81 and a few corollaries, showing that the standard parameter learning problem becomes 
a special case of the result. 

Let £{6) = {uj\pi^ = pg} be the set of parameters w which have distributions same as pg. We note 
that the distributions corresponding to different values of parameters in the set S{0) are identical 
and hence cannot be distinguished from each other given any amount of sampled data. We now 
define 

J\f{e, e) = {uje G| 3^',9'ee,o<.'<. 11^ -uj'\\< e, J G £{6'), p' - ^|| < e - e'} (1) 

In other words, uj belongs to N{9, e) if it is within e' < e distance of a parameter value which has the 
same probability distribution as a parameter value within e — e' oi 9. This definition is illustrated 
graphically in Fig. [TJ We observe the following properties of M{9, e): 

L (Symmetry) If Oi G AA(02, e) then 02 G M{9i,€). 

2. (e-ball) An e-ball B{9,e) around 9 is contained in J\f{9,e). If B{9,e) is an identifiable family, 
then B{9,e) =J^{9,e). 

3. (Equivalence) If pg^ = pg^, then 6i £ M{92-, e) for any e > 0. 

Thus AA(0, e) can be viewed as an "e-ball" around taking probability distribution into account. 
For example, values of parameters with identical probability distributions cannot be distinguished 
by this metric, which is consistent with statistical identifiability. 

Lemma 2.4. J\f{9,e) is an open semi-algebraic set. 

Proof:AA(0, e) is open since, a sufficiently small open ball around any point oj G M{9,e) is also 
contained in N{9,e). To see that it is algebraic we recall that by Theorem 12.31 there exists an N, 
such that 9i G £{92) if and only if 

N 

Q{9i,92) = Y,{M,{9i) - M,{92)f = (2) 

1=0 

which is an algebraic condition. Hence, by applying the Tarski-Seidenberg theorem to eliminate the 
existential quantifiers in Eq. [U we see that M{9, e) is semi-algebraic. □ 

Theorem 2.5 (Lower bound). Let pg be a polynomial family. There exists G N and t > 0, such 
that for any sufficiently small e > and any 9i,92 G @, if \Mi{9i) — Mi{92)\ > e for at least one 
i<N, then 9i ^ M{92,0{e^)). 

Proof: 

Choose as in Theorem 12. 3[ We start by observing we can replace the condition 
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\Mi{e^)-Mi{e2)\>ehy 

N 

Q{0i,92) = \Mi{Oi) - Mi{e2)\^ > Ne^ 
1=1 

in the statement of the theorem. Since the existence of t is not affected by the substitution of A^e^, 
instead of e, to simphfy the matters we wih assume that Q{Oi, 62) > e. 

From Theorem 12.31 we recah that if for some i < N |Q(^i,^2)| 7^ then pg-^ 7^ pg^. Let 6 he a 
positive real number. Consider the set X = {^i,02|^i G -^(^25 5)}- From Lemma 12.41 and the fact 
that the relationship 61 G J\f{92,d) is symmetric, it follows that X is an open subset of x 0. 
Hence the set 9 x 9 — X = {6*1, 6*2 G 0, ^ -^{02, S))} is compact and since (5(6*1, 6*2) > for any 
(^1,^2) G 9 X 9-X we have 

inf Q{9i,92)>0 (3) 

6'i,f2ee,6»i^Ar(6»2,<5)) 

By an argument following that in Lemma 12.41 we see that X and hence its complement are semi- 
algebraic sets. 

Consider now the set Ss, S > given by the following expression 

^5 = {e > I ye,,g,ee {Oi i AA(^2, 5)) ^ Q(^i, ^2) > e}- (4) 

Since these logical statements can be expressed as semi-algebraic conditions, by the Tarski-Seidenberg 
theorem is a semi-algebraic subset of M. Let e((5) = inf 5"^. From Eql3]we have that e{S) > 
for any positive 6. Since the number e(5) > is easily written using quantifiers and algebraic 
conditions, the Tarski-Seidenberg theorem implies that it is a semi-algebraic set and hence satisfies 
some algebraic equatio^ whose coefficients are polynomial in 5. 

We write this polynomial as q{x) = g^f ((5)x*^ + . . .+g'o(5), such that q{e{6)) = 0. We can assume that 
qoi^) is not identically zero (dividing by an appropriate power of x if necessary). From Lemma [27 
we see that if q{e{6)) = then 

e{6)> 



The last quantity is a ratio of two polynomials in 6 and can thus be lower bounded by C(5*'), so 
that e{6) > C5*' for some t' > 0, when 5 is sufficiently small. 

Putting t = jr and recalling the definition of Ss, we see that Q {01,62) < e, implies 9i G Af{92, 0(e*)), 
which completes the proof of the theorem. □ 

Lemma 2.6. Let 6 be a positive root of the polynomial q(x) = aux^ + . . . -|- ao, ao 7^ 0. Then 
S > min(^^fjJ^,l). 

PROOF:We have <5(Ef=i ai^'^^) = -ao- For < 5 < 1 we have X^fli ai6'^^ < YlfLi \ai\, and the 
statement follows. □ 



^Note that strict inequalities alone cannot define a set consisting of a single point. 
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Proposition 2.7 (Upper bound). Let pg be a polynomial family. For any € N there exists a 
C > 0, such that 

N 



Y,\Mi{ei) - Mi{e2)\'' < c\\9i 
1=1 

If Q is contained in a ball of diameter B, then C is hounded from above by a polynomial of B. 

Proof:To prove the claim it is sufficient to show that each summand \Mi{9i) — Mj(^2)P is bounded 
from above by C"||0i — ^2|P) which is equivalent to proving that l^^^'^|^;:gf^)l < VC. We now 
observe that by the mean value theorem 

\MAei) - MAe2y 



< sup||grad(Mi)W|| 
in — 1^211 eee 

where grad is the gradient of the function Mj. Since Mj is a polynomial, all elements of the vector 
grad(Mj) are polynomial in 9. Therefore 

sup llgrad(Mi) (0)11 < C" B^ 
eee 

where t is the maximum degree of these polynomials and C" is an appropriate constant. This 
implies the statement of the Proposition. □ 

Now we have the following: 

Theorem 2.8. There exists an algorithm, which, given e > and 1 > 5 > and P(^,^,i3) 
samples from pg,6 G Q, where Q is the set of parameters within a ball of radius B and P is a 
polynomial depending only on the distribution family, outputs 9, s.t. 9 & J\f{9,e) with probability at 
least 1 — 6. The algorithm also requires a polynomial number of operations. 

PROOF:From Theorem [23] it follows that there exists an G N and t > 0, such that if Vi=i,...,Ar|Mj(^)- 
Mi{9)\ < e*, than 9 G M{9,e). Thus it is sufficient to estimate each moment within 0(e*). Prom 
Lemma iD . 1 1 (moment estimation) this can be done with probability 1 — 6 given a number of sample 
points poly(^, |) = poly(i, ^) by computing the empirical moments of the sample. Once we have 
precise estimates of the ffist moments a simple grid search suffices to find the corresponding values 
of parameters. Indeed, suppose that Q is contained in a ball of radius B in M™. Then the desired 
estimate can be obtained by conducting a grid search over a rectangular grid of size 0( ^y^^ ) and 
invoking Proposition 12.71 We see that the number of operations is polynomial in e and the main 
theorem is proved. □ 

To simplify further discussion we will now define the radius of identifiability: 

Definition 2.9. As before let pg, 9 £ Q be a family of probability distributions. For each 9 we 
define the radius of identifiability as follows 

^{9) = sup{r > O|V0i / 92, {\\9i " ^11 < r, ||02 - ^|| < r) ^ {pg, ^ pg,)} 

In other words, M{9) is the largest number, such that the open ball of radius S${9) around 9 in- 
tersected with is an identifiable (sub)family of probability distributions. If no such ball exists, 
M{9) = 0. 
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From Theorem 12.81 and the definition of the radius of identifiabihty we have the following 

Corollary 2.10. There exists an algorithm, such that, given e > 0, for any identifiable 9 £ Q, 
where Q is the set of parameters within a ball of radius B, it outputs 9 within vD.m.[e,M{9)) of 9 with 
probability 1 — 5, using a number of sample points from pg polynomial in max ^i, > | and B. 

Corollary 2.11. More generally, if9£Q, where G is the set of parameters within a ball of radius 
B, is not identifiable but, £{9) = {^i,...,^^} is a finite set, there exists an algorithm, such that, 
given e > 0, it outputs 9 within min(e, minj ^(^j)) of 9i for some i S {!,..., A;} with probability 

1 — (5, using a number of sample points from pg polynomial in max ^i, ^^^g \^ ^ and |. 

This last result is what we need to analyze Gaussian mixture model in the next Section. 

Remark: It is important to note that the radius of identifiabihty depends on the choice of family 
0. Specifically, the radius is a decreasing function on the family of the sets ordered by inclusion. 



3 Gaussian Distributions and Polynomially Reducible High Di- 
mensional Families 

The main result of this section is to show that there exists an algorithm for estimating parameters of 
high-dimensional Gaussian mixture distributions in time polynomial in the dimension n and other 
parameters. We note that the techniques from the previous section cannot be applied directly to 
high-dimensional distributions since the number of parameters generally increases with dimension. 
Instead our approach will be to show that parameters of high-dimensional Gaussians can be esti- 
mated using poly(n) linear projections to linear subspaces, whose dimension is independent of n. 
We will call this property polynomial reducibility and will also briefly discuss some other families 
satisfying this condition later in the section. 

We will now specifically discuss the case of a mixture of Gaussian distributions. Let pg = '}2!i=i WiN{iJ,i, Sj) 
be a mixture of k Gaussian distributions in M", with means ^Uj and covariance matrices Sj. Let us 
consider the parameters of the distribution 9 = {^i, Si, tt;i, . . . , /Xfc, S^, iffc) as a single vector (thus 
fiattening the covariance matrices). We take the usual Euclidean distance in this space (which, in 
fact, corresponds to the Frobenius distance for the covariance matrices). 

We will assume that the number of components k is fixed. We note that any permutation of the 
mixture components leads to the same density function and hence cannot be identified from data. 
On the other hand, it is well known ([22]) that the density of the distribution determines the 
parameters uniquely up to a permutation, if and only if any two components with the same means 
have different covariance matrices and no mixing coefficient is equal to zero. 

The main result of the section is given by the following 

Theorem 3.1. Let pg = Yli=i WiN{fii, Sj),^ G 0, where is the set of parameters within a ball of 
radius B, be a mixture of Gaussian distributions inM" with radius of identifiabihty ^{9) . Then there 
exists an algorithm , which, given e > and 1 > 5 > 0, and poly ( n, max ( ^, ^r^y j ,^,B] samples 
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frompo, with probability greater than (1—6), outputs a parameter vector 6 = (^(/ii, Si, tii), . . . , (/ifc, S^, G 
0, such that there exists a permutation a : {1,2,... ,k} — t- {1,2,... ,k} satisfying, 

k 

We note that the raduis of identifiabihty ^(9) can be calculated explicitly from the Proposition 13. 3t 

{^{d))"^ = min ( - min — + — , minu;^^ ) 

\4 ij^j i J 

Thus if the mean/variance pairs for any two components are different with difference bounded from 
below and the minimum mixing weight is is also bounded from below, then we have explicit lower 
bound for ^{6). 

In fact even when ^%[9) is not known in advance, it can be estimated from data as: 

Theorem 3.2. Let pg = X^^Lx tt;jA^(/ij, Sj), € 0, where is the set of parameters within in a ball 
of radius B, be a mixture of Gaussian distributions in M" with radius of identifiability M{Q\ Then 
there exists an algorithm , which, given e > and 1 > 5 > 0, and poly (n, ^,B) samples from pg 
outputs whether ^{6) < e with probability greater than 1 — 5. 

□ 

The rest of the section is structured as follows: 

In subsection 13 . 1 1 we discuss various properties of Gaussian mixture distributions. In particular we 
derive the formula for the radius of identifiability (Proposition 13. 3p and show that there exists a 
low-dimensional projection such that the radius of identifiability changes by at most a linear factor 
(Theorem [321) • 

In subsection 13.21 we give a sketch for the proof of the main theorem, showing how the parameters 
of a high-dimensional distribution can be estimated from a polynomial number of projections. The 
details of the proof as well as the proof of Theorem 13.21 are given in the appendix O 

Finally, we note that our results apply to high-dimensional distributions which are not mixtures of 
Gaussians with a fixed number of components. For example, a product of n 1-dimensional Gaussian 
mixture distributions, which is a Gaussian mixture distribution in n dimensions with /c" components, 
can be easily learned using our methods. The same applies to other product distributions whose 
components are polynomial families. 



3.1 Gaussian Distributions 



Proposition 3.3. Let pg = X^^=i WiN{^i, Sj), 6 £ @ be a family of mixtures of Gaussian distribu- 
tions in M" with non-zero mixing weights. Then the following inequality is satisfied: 

{^{e)f >min( ^min(||^i-/Xjf + ||Si-Sjf),minw;n . (5) 
\4 j^j I J 
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Moreover, suppose Q is a convex se@ such that it contains all possible mixing coefficients {wi, . . . , Wk) 
for any fixed set of means and variance^ 

In this case the inequality becomes an equality: 

iSg{9)f = mill ( ^miii (\\fii - Hjf + - ^jf) ,mmwf] (6) 
In particular, the radius of identifiability is invariant under the permutation of components. 

PROOF:We will start by proving the inequality [5j Suppose that the distributions pgi and pg" have 
the same density. To prove the inequality, we need to show that at least one of 6' , 9" is no closer to 
9 then the right hand side of the inequality 

Let us first consider the case when there is no pair i ^ j, s.t. /i^ = /xj and = S^'. In that case 
that case at least one of the mixing coefficients for one of the mixtures must be equal to zero. That 
implies that either ||0 — > miuj Wi or ||^ — ^"|| > miuj Wi, which is consistent with theO 

Alternatively, suppose that for some i / j we have {fj,[, S'j) = (^uj, S"). Put v' = {fj,^, S'^) = {fij, S"), 
vi = V2 = {fij, Sj). We see that 

wo" Q\\2 , WQi i3l|2\||/ ||2|||/ ||2\-'-|| ||2 

||t^ — t^ll + ||t^ — t^ll > \\V — fill + \\V — V2\\ > 211^1 ~ ^2|| = 

_ 1 II l|2 I ''^ II V V l|2 

— 2ll^**~^jll +2" 

Therefore, max{||0' — 0|p, \\9" — 0|p} > jdl/^i — AijiP + ll^^i — ^ilP) which is again consistent with 
Inequality [5] and together with the first case implies the inequality. 

To show Eq. [6] we need to observe that the bound is tight. Again we consider two possible cases. 
If the minimum in the right hand side of Eq. [6] is equal to the square of one of the mixing weights, 
say, Wi, construct 9' by putting w[ = and keeping the rest of the parameters of 9. We see that 
11^' — 9\\ = Wi. By slightly perturbing /i', we see that there exists a 9" arbitrarily close (but not 
equal)to 9' with the same probability density. Thus the radius of identifiability cannot exceed Wi. 

Alternatively the minimum in the right hand side of Eq.[6]could be equal to j (||/ii — /Uj |p + ||Sj — Sj |p) 
for some i j. Construct 9' by putting fi'^ = jJ-'j = ^ (fJ-i — fJ-j) and = S^- = ^ (Sj — Sj) and keeping 
the rest of the parameters of 9. It is easy to see that ||^' — 0|p = j (||/Xj — fijW^ + ||Sj — Sj|p). Note 
that 9' £ Q hj the convexity condition. By perturbing Wi and Wj slightly, and keeping the rest of 
parameters fixed, we can obtain 9" arbitrarily close to 9' with the same probability density. Hence 
the radius of identifiability does not exceed | (||/ij — /ij|P + ||Sj — Sj|p), which completes the proof. 
□ 

From the discussion above we have the following 

Corollary 3.4. Let Q be a convex set, such that for any 9 & Q all mixing coefficients Wi are 
nonzero. Then 

m9)f = \ min (11^, - ^,^f + ||E, - S.f) (7) 



^Note that requiring convexity is natural, since the set of positive definite matrices is a convex cone. 
''This requirement is unnecessarily strong, however the precise condition, evident fi'om the proof, is awkward to 
state. 
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It is also easy to see that the radms of identifiabihty satisfies a type of triangle inequality and 
that under any permutation of (mean, covariance matrix, mixing weight) triples the radius of 
identifiabihty does not change. This is expressed in the following two lemmas (the straightforward 
proofs are omitted): 

Lemma 3.5. Let po = "^^^i WiN{^i, Sj), 9 @ be a family of mixtures of Gaussian distributions 
in R". For any 61,62 e@ such that 61 E M{62, e) for some e > 0, |^(6'i) - ^(6*2)! < e. 

Lemma 3.6. Let pQ = 'Y^^^^WiN{^i,Tii),6 ^ Q be a mixture of Gaussian distributions in M". 
Suppose 6 is represented as 9 = {61,62, ■■■ ,6k), where 6i = {fii,T,i,Wi) is the mean, covariance 
matrix, mixing weight triple. Let 6 = (^o-(i) , ^cr(2) > • • • > ^a{2))y where a : {1,2, . . . , k} ^ {1,2, . . . , k} 
is a permutation. Then ^{6) = M{6''). 

From now on, we will assume that is a sufficiently large ball or cube (with the necessary conditions 
to make pq a valid probability distribution), so that we do not have to worry about convexity and 
other technical properties. 

We now recall that a projection of a Gaussian mixture distribution onto a subspace is a lower- 
dimensional Gaussian mixture distribution. Specifically, if = tt;jA^(/Xj, Sj), the Gaussian 
mixture distributions in M", is projected onto a subspace S then the projection is a lower-dimensional 
Gaussian mixture distribution family TTs{Pe)j parameterized by Ps{6). In particular, if 5" is a coordi- 
nate plane then Ps is a projection operator, which is an identity mapping for the mixing weights, an 
orthogonal projection onto S for the means and the restriction operator for the covariance matrices 
of the components, where each covariance matrix is projected to its minor corresponding to the 
coordinates in S. 

We will now state the following Theorem whose proof can be found in appendix [Cl 

Theorem 3.7. Let pg = 'Y^^^^WiN{^i,Tii), 6 ^ @ be a Gaussian mixture distribution in M" with 
radius of identifiabihty M{6). Then there exists a 2k'^ -dimensional coordinate plane S, such that 

3.2 Sketch of the Proof of Theorem EH] 

We present a brief overview of the proof. The technical details can be found in Appendix O 
The main idea is to show that parameters of high-dimensional Gaussian mixture can be estimated 
arbitrarily well using poly(n) projections to coordinate subspaces, whose dimension only depends 
on k. Since the dimension of these lower dimensional subspaces is independent of n, results from 
Section [2] can be used to estimate the parameters. 

Let 6 = {61,62, ... , 6k), where 6i = {jii, T,i,Wi), be the parameter vector after flattening the covari- 
ance matrices. Recall that projection of po onto a 2fc^-coordinate plane T, will result in a mixture 
TTTiPe), parameterized (with a slight abuse of notation) by Pt{6) = {Pt{(^i), Pt{02), • • • 1 -Pr(^fc))- 

Step 1: Let ^{6) be the radius of identifiabihty. Theorem 13.71 guarantees the existence of a 
2/c^-dimensional coordinate subspace S, such that radius of identifiabihty decreases by at most ^, 
mPs{0)) > \^{B). 
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To identify such a subspace, we take all (2^2) coordinate projections. For each projection to a 
subspace T we estimate the parameters using the Theorem 12 .81 It is important to note that given e' 
as input, Theorem 12.81 is guaranteed to produce a value of parameter Pt{6)i such that \M{Pt{0)) — 
^{Pt{6))\ < e' (Lemma 13. 5p using a number of samples polynomial in k and -p. Applying the 
union bound for all (2^2) projections provides an estimate for the radius of identifiability for each 
projection within e'. Choosing e' appropriately (say, ■^^), and choosing the projection with the 
largest estimated radius of identifiability, yields a coordinate subspace S with a lower bounded 

The coordinates within S are represented by the horizontally shaded region in Figure [2j We use 
this space as a starting point for Step 2. 

Step 2: By applying Corollary 12.111 to the 
projection Ps{9), we can estimate the mixing 
weights, projections of the original means and 
2k^ X 2k'^ minors of the covariance matrices cor- 
responding to the coordinates within S. We now 
need to estimate the rest of the parameters us- 
ing a sample size polynomial on n. We do this 
by estimating each additional coordinate sepa- 
rately. That is for each coordinate i not in S 
we take Si = span{S,ei), where is the corre- 
sponding coordinate vector. It can be seen that 
the radius of identifiability does not decrease go- 
ing from S to Si. We show that the i'th coordi- 
nate of each component mean can be estimated 
by applying Corollarv 12.111 to the projection to 
Si. We repeat this procedure for each of the 
n — 2k^ coordinates not in 5. 

To estimate the covariance matrices we proceed 
similarly, except that we need to estimate en- 
tries corresponding to pairs of coordinates (i, j). 
Figure 2: Estimation of high-dimensional Gaussian Now we have two possibilities, since either one 
mixture parameters from poly(n) lower-dimensional of i,j or both of them may not be in S. If ex- 
projections, actly one of them, say i, is not in S, projection 
to Si defined above can be used to estimate the corresponding entry of each covariance matrix. If 
both i,j are not in 5, we take the projection onto Sij = span{S, e^, ej). By applying Corollarv 12. IH 
we show that the ij'th entry of covariance matrices can also be estimated. 

Thus, after obtaining the initial space S, the complete set of parameters can be estimated using at 
most n — 2k'^ + ("~2 ) parameter estimations for 2/c^ -|- 1 or 2k'^ + 2-dimensional subspaces. 

This procedure is graphically shown in in Figure [2j 
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4 Conclusion and Discussion 



The results of this paper resolve the general problem of polynomial learning of Gaussian mixture 
distributions. Our results do not require any separation assumptions and apply as long as the 
mixture is identifiable. For example, they apply even if all components of the mixture have the 
same mean distribution, as long as the covariance matrices are different and the mixing coefficients 
are non-zero. 

The proof brings the techniques of algebraic geometry to the classical method of moments, an 
approach that, as far as we know, is new to this domain. We also provide quite general results 
applicable to learning various low-dimensional families and some observations on high-dimensional 
families going beyond Gaussian mixture distributions with a fixed number of components. For ex- 
ample, one can also learn products of arbitrary probability distributions in a fixed low-dimensional 
polynomial family, e.g., a product of n number of d-dimensional Gaussians mixtures with k compo- 
nents each (which is a nd-dimensional Gaussian mixture distribution with /c" components). 

We are planning to investigate other applications in learning of the framework presented in this 
paper. We also note that the methods proposed in the paper can be turned into implementable (and 
potentially practical) algorithms through the use of tools from computational algebraic geometry. 
This is also a direction of future investigation. 
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Appendix 



A Some Polynomial Families of Distributions 



In this appendix we provide a partial list comprising the expressions of moments of various univariate 
probability distributions which form polynomial families. It turns out that most of the commonly 
used distributions form polynomial families as shown Table [2l In the fifth column of Table [21 
we provide either expression for the i^^ moment or a recurrence relation, which shows that the 
moments are polynomial in the distribution parameters, along with explicit expressions for the first 
three moments. These moment expressions and recurrence relations are well known and can be 
found in, e.g., [151 121j. In a couple of cases we need a slightly different parameterization, instead of 
the standard one, to ensure that the moments are polynomial in these new parameters. For example, 
in standard parametrization. Negative Binomial distribution NB{r,p) is expressed by probability 
mass function (^^I^ )(-'^ ~ pYp^- However, if we replace p by a new parameter m = then the 
moments are polynomial in r and m. Recurrence relation for this new parameterization can be 
obtained following the same steps as in [21]. Table [3| we list two families which are not polynomial. 



B Separation Preserving Coordinate Planes 

Let pe = Yl\=i'^i^{l^i^^i) be a mixture of k Gaussian distributions in M', with means and 
covariance matrices Ej. When this distribution is projected onto any lower dimensional coordinate 
plane S*, the corresponding Gaussian mixture irsipe), parameterized by Ps{d), has means and 
covariance matrices represented by Ps{^i) and PsiX'i) respectively. We first show that if any 
pair of means or pair of covariance matrices of the original component Gaussian distributions are 
separated, then they remain so after projecting the mixture distribution onto some suitable lower 
dimensional coordinate plane. 

Existence of a Coordinate Plane w^here Projected Means Remain Separated : 
Lemma B.l. For any /Ui,/i2, G M', there exists a -coordinate plane S such that, 



^ij, WPsilJ'i) - Ps{l^j)\\ > Wf^i - f^j 




Proof: We will use Ai to denote a set of indices of coordinate directions of M and let Sm be the 
I I -coordinate plane, where is the cardinality of M, spanned by the coordinate directions 
whose indices are in M. Initially M is empty. Let Ai = {1, 2, . . . , i — 1, i + 1, . . . , k}. Now consider 
the pair consisting of fii and any other fij such that j £ Ai- There exists at least one coordinate 
direction, whose index is say m, such that |/ii,.m — fij,m\ > ll/^i — t^jW^f Adding m to Ai, and 
projecting onto Sm, guarantees that ||-Psa<(/"i) ~ Psj^if^j)]] ^ ll/^i ~ /^jll"^- Note that for /ii, in the 
worst case, we may have to include indices of (k — 1) extra coordinate directions to Ai to ensure 
that after projection onto Sm, II^'5ai(w) - -Psai (/^j)ll > llw " /^jIItt for any j G Ai. 
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Distribution 


e 


Pdf/Pmf 


Mgf M{t) 


Moments Expression 


Gaussian 


/i,cr 


J- e 




E(X*) = /LiE(X* ) + - l)cr''E(X* 2) 


E(X) = ^ 
E(X2) = n^ + a^ 
¥(X^) — 11^ '^iirr^ 


Uniform 


a, h 








E(X) = 


Gamma 


/3, m 


^.m-lg-a://3 ^ ^ 


(1 - ^t)-"" 




E(X) =m/3 
E(X2) = m(m + 


Laplace 




1 |a;-M 




^l-^ J - 2^j=o (i-j)! IS even} 


E(X) = ^ 
E{X^) = + 262 
E{X^) = fi^ + Qfib'^ 


Exponential 


A 


je~f, X > 


(1 - xt)-' 


E{X') = ilX' 


E{X) =X 
E(X2) = 2A2 
E(X3) = 6A3 


Chi-Square 


k 


2fc/2r(fc/2)' ^ ^ 


(1 - 2t)^2 


E(X*) = k{k + 2)--- {k + 2i-2) 


E{X) = k 
E(X2) = k{k + 2) 
E{X^) = k{k + 2){k + 4) 


Inverse 
Gaussian 








E[X) = [2i — 6)XfiE[X j + fiE[X ) 


(l-\/l-2AM2t) 

e 


E{X) = fi 
E(X2) = Am^ 

E(A j = 3A 


Poisson 


A 




e^(e*-i) 


ltl(^A j — AJli(A j + A 


E(X) = A 
E(X2) = A2 + A 

Tfff y3\ \3 1 Q\2 1 \ 

IH-I^A ) — A + OA -\- A 


Binomial 


n,p 




(1 -p + pe*)" 


E(X') = npE(X'-^) + p(l p) 


E{X) =np 

E(X2) = n{n - l)p2 + np 

E(X^) = (n^ - 3n2 + 2n)p^ + 3n(n - 1)^2 + np 


Geometric 


P 


(i-jr(J) 


1 

p— (p— l)e* 


EfXM = T°° r.- (l- -Y 

'^y^ ) 2^j=o p) J 


E{X) ={p-l) 

E(X2) = (p-l)(2p-l) 

E(A j = [p — i )yop — bp + 1) 


Negative 
Binomial 


r, m 


\ r-l J {m+lY+'^ 


/ 1 y 

\ m+l—me* J 


E(X*) - rmE(X*-i) + m(m + 1) '^('^(J""')) 


E(X) =rm 

E(X2) = r(r + l)m2 + rm 

E(X^) = (r^ + 3r2 + 2r)m^ + 3r(r + l)m2 + r?n 
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Table 2: Common polynomial families and their moments 



Distribution 


e 


Pdf/Pmf /(x;^) 


Mgf M{t) 


Moments Expression 


Weibull 


k,X 




v^oo t"A" r h ^ 

l^n-O nl ^ 


E{X') = AT (1 + 1) 


Cauchy 


A, 7 


TT'y 


1 

l+(^)' 




Does not exist 


Does not exist 



Table 3: Examples of some probability distributions that do not belong to polynomial family 



Similarly, in addition, to ensure that that after projecting onto Sm, Psm^^"^) guaranteed to 
remain separated from any Psj^^{nj), j ^ A2, by at least \\fi2 ~ /^ill"^' need to add indices 

of {k — 2) additional coordinate directions to Ai and so on. So in total M can have indices of 
at most (/c — 1) + (fc — 2) + • • • + 1 = ^^^^ < ^'^ coordinate directions to ensure that as long as 
we project {;Uj}^^^ onto (^2) different /c^-coordinate planes, there exists at least one fc^-coordinate 
plane S such that Vjj, ||-Ps(/^i) - -Ps(/"j)|| > ll/^i - hW^i- ^ 

Existence of a Coordinate Plane where Projected Covariance Matrices Remain Sepa- 
rated : 

Lemma B.2. For any Si,S2, ■■■,'^k £ M'^', there exists a k'^ -coordinate plane S such that, 



Proof: We will use A4 to denote a set of indices of coordinate directions of and let Sm be the 
I I -coordinate plane, where \A4\ is the cardinality of A4, spanned by the coordinate directions 
whose indices are in A4. Initially A4 is empty. Let = {1, 2, . . . , i — 1, i + 1, . . . , k}. Now consider 
the pair consisting of Si and any other such that j G ^1. Since Si and Sj must differ in at least 
one diagonal or off-diagonal element by an amount ||Si — Sj||j, there must exist two coordinate 
directions, whose indices are say, p and q, such that adding p and q to Ai and projecting onto Sm 
guarantees that \\Psm (Si) — P5^(Sj)|| > ||Si — Sj|| j. Note that for Si, in the worst case, we may 
have to add indices of 2{k — 1) extra coordinate directions to M to ensure that that after projecting 
onto Sm, \\Psm(.^^) - PsMi^j)\\ ^ 11^1 - ^illj any j G Ai. 

Similarly, in addition, to ensure that that after projecting onto Smi PsmC^'^) guaranteed to 
remain separated from any P5^(Sj), j G A2, by at least ||S2 — Sj||j, we may need to add indices 
of 2{k — 2) additional coordinate directions to Sm and so on. So in total M can have indices of at 
most 2{k — 1) + 2{k — 2) + • • • + 1 = k{k — 1) < fc^ coordinate directions to ensure that as long as 
we project {SjjjL]^ onto (^2) different fc^-coordinate planes, there exists at least one fc^-coordinate 
plane S such that Vjj, \\Ps{^i) - Ps{^j)\\ > ll^i - Sj||f □ 



C Proof of Theorem 13. IL Theorem 13.21 and Theorem 13.7 

In this appendix we give the detailed proof of Theorem 13.11 as well as proof of Theorem 13.21 and 
Theorem 13.71 We start with some preliminary Lemmas. 
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Lemma C.l. Let pg = Y^'^^^WiN{fii,T,i), where 9 is the set of parameters within a ball of radius 
B, he a mixture of Gaussian distributions in with the radius of identifiability M{9). If 9 is 
represented as 9 = {9i,92, ■ ■ ■ ,9/;.), where 9i = {iJ,i,T,i,Wi) is the mean, covariance matrix, mixing 
weight triple, (after flattening the covariance matrices) then for any i ^ j, \\9i — 9j\\ > 2^(9). 

Proof: Explicit expression for ^{9) is given in Equation[6l If | minj^j — + — < 
miniUtf then for any i / j,{M{9))'^ = i minj^j - + ||Ej - < \\\9i -9j\\'^. On 

the other hand if | minj^j — + — Sjp) > mm.iwf then for any i ^ j, (=^(0))^ = 
ramiwf < j minj^j - /Ujf + - Sjf) < \\\9i - 9,jf. □ 

Lemma C.2. Let pg = Y^'^^^WiN{fii,T,i), where 9 is the set of parameters within a ball of radius 
B, he a mixture of Gaussian distributions in M" with radius of identifiability S${9). Let S and T be 
two lower- dimensional subspaces such that S C T. Then M{Pt{9)) > M{Ps{9)). 

PROOF:Immediate from Equation [6l □ 
Proof of Theorem 13.11 : 

Let 9 = {9i,92, ■ ■ ■ ,9k), where 9i = (/ij, Sj, ujj), be parameter vector after flattening the covariance 
matrices. Recall that projection of pg onto any 2A;^-coordinate plane T, will result in a mixture 
TTTipe), which is parameterized (with a little abuse of notation) by Pt{9) = {Pt{9i), Pt{92), ■ ■ ■ , PT{9k))- 

Details of Step 1: Let 7 = min (^—^, where ^{9) is the radius of identifiability. Theorem 13.71 

guarantees the existence of a 2A:^-dimensional coordinate subspace S, such that radius of identifi- 
ability decreases by at most ^, SS{Ps{9)) > ^^{9) > 7. To identify such a subspace, we take all 
(2^2) coordinate projections. For any fixed projection to a 2fc^-dimensional subspace T, invoking 
Theorem 12.81 using a sample of size poly(^, ^,B), (setting the precision parameter to ^) produces 

a value of parameters Pt{9) such that \^{Pt{9)) - ^{Pt{9))\ < ^ (Lemma [33]). Applying the 
union bound for all (2^2) projections provides an estimate for the radius of identifiability for each 
projection within ^. Thus invoking Theorem 12.81 (2^2) times, each time using a sample of size 

poly ^i, -pyl^, -B^ , (setting the precision parameter to ^) and choosing the projection with the 
largest estimated radius of identifiability, yields a coordinate subspace S such that with probability 
at least 1 - |, ^{Ps{9)) > |. Clearly the sample size requirement for this step is polynomial in n. 

Details of Step 2: By applying Corollary 12.111 to the mixture irsiPe), where S is obtained in 
Step 1, using a sample of size poly(^, |, B), (setting the precision parameter to ^) with probability 

greater than 1 — | we can get an estimate of Ps{9) satisfying ||P5(0) — Ps{9)\\ < ^. Note that these 
estimates encompass the mixing weights, projections of the original means and 2/c^ x 2/c^ minors of 
the covariance matrices corresponding to the coordinates within S. If we let 9 to be Ps{9) then the 
estimate 9' = Ps{9) is, up to a permutation, within ^ of with probability greater than (1 — j). 
Note that the dimension of ^ is {k — 1) + k (^2k'^ + ^^Li^—tll^ , These parameters are represented 
by the horizontally shaded region in Figure [2j 

We now need to estimate the rest of the parameters using a sample size polynomial on n. This 
procedure explained in the following two sub-steps. 

2a: Estimating means and part of covariance matrices 
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In this sub-step we estimate each additional coordinate separately. That is for each coordinate i 
not in S we take Si = span{S,ei), where is the corresponding coordinate vector. It can be seen 
that the radius of identifiabihty does not decrease going from S to Si. We will show that the i'th 
coordinate of each component mean, i'th diagonal entry for each component covariance matrix and 
extra off diagonal entries for each component covariance matrix can be estimated by applying 
Corollarv l2.11l to the projection to Si. We repeat this procedure for each of the n — 2k'^ coordinates 
not in S. 

For each such n — 2k^ coordinates not in S, we project pe onto Si and invoke the algorithm of 
Corollarv l2.11l (setting the precision parameter to be ^) using a sample of size poly jjj^, -B^ 

Clearly this sample size is polynomial in n. This ensures that, each time we get an estimate Psi{G) 
such that with probability at least 1 - |, ||^^^) - Ps^{0)\\ < |- Since PsiO) C Ps.iO) letting 
(pi = Psii^) \ Ps{(^) to be the extra parameters, we have for each i, 

\\^,-cp,\\ = \\i\{e)-PsM\\<l 

with probability greater than (1— |), where (f> is the estimate of (p. Since for each Si, Vm^„, \\Psi (dm) — 
PSi{(^n)\\ ^ (using Lemma IC.ll and Lemma IC.2p . estimates of the extra parameters can be 
uniquely associated to the parameters of the component Gaussian distributions estimated in Step 
2. 

Letting O" to be U^~^^ (pi, we have 



n-2A:2 



n 



with probability greater than (1 — |), where §" is the estimate of O" . 

Note that the dimension of 9" is k{n - 2P){2 + 2k'^), where each Ps.iO) \ Ps{0) 

encompasses i'th 

coordinate for each component mean, z'th diagonal entry for each component covariance matrix and 
2fc^ extra off diagonal entries for each component covariance matrix. These parameters represent 
the diagonally shaded region in Figure [2j 

2b: Estimating the remaining entries of covariance matrices 

To estimate the the remaining parameters of the covariance matrices we need to estimate entries 
corresponding to pairs of coordinates when both i and j are not in S. We take the projection 
onto Sij = span{S,ei,ej). It can be seen as before that radius of identifiabihty does not decrease 
going from S to Sij. By applying Corollarv 12.111 we will show that the zj'th entry of covariance 
matrices can be estimated. Since there are ('^"g ) such projections, we repeat this procedure ["'~2^ ) 
times, each time we project po onto appropriate Sij and invoke the algorithm of Corollary 12. IH 

(setting the precision parameter to ^) using a sample of size poly (^^ , jjj^^ , . Clearly this 
sample size is polynomial in n. This ensures that, each time we get an estimate Psij (0) such that 
with probability at least 1 — \\PSij{(^) — PSij{(^)\\ < g- Since Ps{9) C Ps^^iO) in each case and 
there are ("~2 ^ ) such cases, letting ij)t,t = 1, . . . , ("~2 ) to be the extra parameters in each case , 
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we have for each t, HV't — V'tll ^ 9 with probabihty greater than (1 — |), where 'tpt is the estimate of 
ipf As before estimates of these extra parameters can be uniquely associated to the parameters of 
the component Gaussian distributions estimated in Step 2. Letting O'" to be the k(^~'^^ ) covariance 

parameters that have not been estimates in the previous steps, we have 6 C tpt, and in 

particular, 



< 

V 



\ t=i 



ill' 




where §'" is the estimate of 9", with probability greater than (1 — |). The parameters represented 
by 0"' are shown in the vertically shaded region of Figure [2j 

In Step 1 we need to invoke Theorem 12.81 (2^2) times. In step 2 we need to invoke Corollary 12.111 
1 + (n — 2A;^) + (" 2^ ) times. Thus total invocation of Theorem 1 2 . 8 1 and Corollary 12.111 combined is 
poly(n). Now note that if e < ^{9) then 7 = ^. On the other hand if e > ^{6) then 7 = ^ < ^. 

Since 9 UO" U O'" = 9, the corresponding estimate (with a little abuse of notation) 9 = §' U §" U 9"' , 
with probability greater than (1 — 5), is within e of 9 only up to a permutation using a sample of 

size poly (n, max (i, , □ 



Proof of Theorem [33] : 

Theorem 13.71 guarantees the existence of a 2fc^-coordinate plane S such that when pg is projected 
onto S, the corresponding mixture TTs{pe), parameterized by Ps{9), satisfies that M{Ps{9)) > 
M{9)^. Since S is not known in advance, projecting pQ on to all (2^2), 2/c^-coordinate planes, 

each time invoking the algorithm of Theorem 12.81 with a sample of size poly (^jjj^^ {s/n'^) ' ^) 
using union bound ensures that for each 2fc^-coordinate plane T, Theorem 12.81 produces a value of 
parameters Pt{9) such that Pt{9) G M [Pt{9), ^) with probability greater than (1 — 5). Now for 
each such 2A;^-coordinate plane T, Lemma 13.51 guarantees that \MPt{9) — MPt{9)\ < Thus 

there must exist at least one 2fc^-coordinate plane (say T*) such that, SS{Pt^{9)) > ^{9)^ — 
Thus, 

me) >e)^ (^(^)) > ^) 

The desired algorithm now works as follows. For each of the (2^2) values of parameters Pt{9) 

outputted by Theorem 12. 81 we compute ^{Pt{9)) using Equation[6l Now set = maxT^(PT(0)). 
If ^* < then output ^{9) < e otherwise output ^{9) > e. □ 

Proof of Theorem 13.71 : 

Lemma IB. II establishes the existence of a /c^-coordinate plane Si, such that Vjj-, \\Psi{^i) — 
-fsil/^j)!!' — ll/^i ~ /^ilPn ^ ~ /^ilP^- Similarly Lemma [B.2I establishes the existence of a 
fc^-coordinate plane 5*2, such that V^j \\Ps2{^i) — P^jC^i)!!' ^ 11^* ~ Sjp^. Taking the span of 
these two planes produces a 2A;^-coordinate plane S = span{Si, S2), such that 

min (||P5(/i.) - Ps{t^j)f + WPsi^i) - Psi^j)f) > min (||^, - + - ) 1 
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Note that radius of identifiability of T^siPe)-, parameterized by Ps{S)-, is given by, 

mPsm? = min Qmin {\\Ps{^ii) - Ps[H)f + WPs^ " Ps{^,)f) ,minwf 
> min ^^^nun - fXjW"^ + - ,mmw'f^ 

where the inequahty follows from the fact that Voi, a2, b, {ai < 02) =^ (min(ai, b) < min(a2, b)). 
case 1: i minj^j - fij\\'^ + - < min^ w'f 

Here {^{9)^ = 3 minj^j - //^f + - S^f ) and 

(^(PsW))^ > (;^) 1 min,^,- (||m, - + - S,f ) = (J,) (^(e))^. 

case 2: ^ minj^j - /Xjp + HSj - > min^ ifj^ 

Here {^{9) f = min^wf. 

If iminj^j (ll/ij - ^jlp + liSi - Sjp) > minj ■Wj^ > minj^j [Wm - /ijp + - SjP) then 
(^(PsW))' > min min (||/i, - + - S^f) ,niinu;2 



> i-min7x;2 = (^(0))2J- 



On the other hand if minj wf < minj^j — /-ijp + — then, 

mPsie))f > min, = (^(e)f > (^) (^(e))^. □ 



D Moment Concentration 

Lemma D.l. LetpQ,6 € C M™ &e a m-parametric family of probability distributions in where 
Q is contained in a ball of radius B in and let Xi,X2, ■ ■ ■ ,Xm be iid random vectors drawn 
frompg. Suppose the moments Mi^^^^i^{6) = J x^^ . . . x^^dpe and the corresponding empirical moments 

Mi^,„i^{e) = ^j'"' are lexicographically ordered as Mi{9), M2i9), . . . and Mi{9) , M2{9) , . . . 



respectively. Then given any positive integer N , and sample size M > ^^j;^' where C IS a 
constant, for any e > and < 5 < 1, \Mi{9) — Mi{9)\ < e for all i < N with probability greater 
than 1 — 6. 

PROOFiFor any i < N, let Mi{9) = J I dpg, where aj{i) is a function of i for 

j = 1, 2, Let /j : — )• M be a function defined as fi{x) = x^^^*^X2^^*^ . . . x^'^*\ For any random 
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vector X distributed according to pg, we have E[/j(X)] = Mi{6). The empirical counterpart is 

--nUiX)]. Now 



defined as ^''='1:^^'^ = Mi{6). Note that E 1^^=^^^^^^^ 



M 



M 



Var ' ^=i^'(^^-) 



M 



Var(/»(X)) 
M 



^¥.{u{x)-nm))y 



= ^i^{mx)?)-{nh{x)]f 

<^{E([/.(X)p)} 

- M J -^l 



2ai(i) 2a2(i) 



. . .X 



2a, (j) 



dpg 



< 



CB^ 



M 



where the last inequality follows from the fact that when the moments are lexicographically ordered, 



for any i < N, the maximum degree of the polynomial x'^^^^'^ x'^^^'^^ . . . x^^'^^^ is at most [-y] . 



Now applying Chebyshev's inequality we get, 



P (\Mii9) - Mi{e)\ >€)=P 



M 



nfiix)] 



> € 



Var 



< 



< 



Upper bounding the last quantity by and using union bound yields the desired result. 



□ 
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